PD Collection CD 1

home *** CD-ROM | disk | FTP | other *** search

/ PD Collection CD 1 / PD Collection CD 1.iso / textual / agrep / Docs / agrep_txt next >

Wrap

Text File | 1994-08-24 | 13KB | 283 lines

AGREP(l) AGREP(l) June 11, 1991 NAME agrep - search a file for a string or regular expression, with approximate matching capabilities SYNOPSIS agrep [ -#cdehilnpsvwxDIS ] pattern [ filename... ] DESCRIPTION agrep searches the input filenames (standard input is the default) for records containing strings which either exactly or approximately match a pattern. A record is by default a line, but it can be defined differently using the -d option (see below). Normally, each record found is copied to the standard output. Approximate matching allows finding records that contain the pattern with several errors including substitutions, insertions, and deletions. For example, Massechusets matches Massachusetts with two errors (one substitution and one insertion). Running agrep -2 Massechusets foo outputs all lines in foo containing any string with distance at most 2 from Massechusets. agrep supports many kinds of queries including arbitrary wild cards, sets of patterns, and in general, arbitrary regular expressions. See PATTERNS below. It supports most of the options supported by the grep family plus several more (but it is not 100% compatible with grep). For more information on the algorithm used by agrep see Wu and Manber, "Fast Text Searching With Errors," Technical report #91-11, Department of Computer Science, University of Arizona, June 1991 (available by anonymous ftp from cs.arizona.edu inside agrep/agrep.tar as agrep.ps). As with the rest of the grep family, the characters `$', `^', `*', `[', `^', `|', `(', `)', `!', `;', and `\' can cause unexpected results when included in the pattern, as these characters are also meaningful to the shell. To avoid these problems, one should always enclose the entire pattern argument in single quotes, i.e., 'pattern'. Do not use double quotes ("). agrep works only on text (ascii) files. If the file is binary, for example, then agrep will generate an error message. Only one error message will be generated even if the file list contains many binary files. When agrep is applied to more than one input file, the name of the file is displayed preceding each line which matches the pattern. The filename is not displayed when processing a single file, so if you actually want the filename to appear, use /dev/null as a second file in the list. OPTIONS -# # is a non-negative integer (at most 8) specifying the maximum number of errors permitted in finding the approximate matches (defaults to zero). Generally, each insertion, deletion, or substitution counts as one error. It is possible to adjust the relative cost of insertions, deletions and substitutions (see -I -D and -S options). -c Display only the count of matching lines. -d 'delim' - 1 - Formatted: August 24, 1994 AGREP(l) AGREP(l) June 11, 1991 Define delim to be the separator between two records. The default value is '$', namely a record is by default a line. delim can be a string of size at most 8 (with possible use of ^ and $), but not a regular expression. Text between two delim's is considered as one record. For example, -d '$$' defines paragraphs as records and -d '^From ' defines mail messages as records. agrep matches each record separately. This option does not currently work with regular expressions. delim cannot currently contain special control characters. -e pattern Same as a simple pattern argument, but useful when the pattern begins with a `-'. -h Do not display filenames. -i Case-insensitive search - e.g., "A" and "a" are considered equivalent. -l List only the files that contain a match. -n Each line that is printed is prefixed by its line number in the file. -p Find lines in the text that contain a supersequence of the pattern. For example, agrep -p DCS foo will match "Department of Computer Science." This option has the same function as -I0, which sets the cost of insertion to zero. -s Work silently, that is, display nothing except error messages. This is useful for checking the error status. -v Inverse mode - display only those lines that do not contain the pattern. -w Search for the pattern as a word - i.e., surrounded by non- alphanumeric characters. The non-alphanumeric must surround the match; they cannot be counted as errors. For example, agrep -w -1 car will match cars, but not characters. -x The pattern must match the whole line. -Ik Set the cost of an insertion to k (k is a non-negative integer). This option does not currently work with regular expressions. -Dk Set the cost of a deletion to k (k is a non-negative integer). This option does not currently work with regular expressions. -Sk Set the cost of a substitution to k (k is a non-negative integer). This option does not currently work with regular - 2 - Formatted: August 24, 1994 AGREP(l) AGREP(l) June 11, 1991 expressions. PATTERNS agrep supports a large variety of patterns, including simple strings, strings with classes of characters, sets of strings, wild cards, and arbitrary regular expressions. Strings any sequence of characters, including the special symbols `^' for beginning of line and `$' for end of line. The special characters listed above ( `$', `^', `*', `[', `^', `|', `(', `)', `!', and `\' ) should be preceded by `\' if they are to be matched as regular characters. For example, \^abc\\ corresponds to the string ^abc\, whereas ^abc corresponds to the string abc at the beginning of a line. Classes of characters a list of characters inside [] (in order) corresponds to any character from the list. For example, [a-ho-z] is any character between a and h or between o and z. The symbol `^' inside [] complements the list. For example, [^i-n] is the same as [a-ho- z]. The symbol `.' stands for any symbol (don't care). The symbol `^' thus has two meanings, but this is consistent with egrep. Boolean operations agrep supports an `and' operation `;' and an `or' operation `,', but not a combination of both. For example, 'fast;network' searches for all records containing both words. Wild cards The symbol '#' is used to denote a wild card. # matches zero or any number of arbitrary characters. For example, ex#e matches example. The symbol # is equivalent to .* in egrep. In fact, .* will work too, because it is a valid regular expression (see below), but unless this is part of an actual regular expression, # will work faster. Combination of exact and approximate matching any pattern inside angle brackets <> must match the text exactly even if the match is with errors. For example, <mathemat>ics matches mathematical with one error (replacing the last s with an a), but mathe<matics> does not match mathematical no matter how many errors we allow. Regular expressions The syntax of regular expressions in agrep is in general the same as that for egrep. The union operation `|', Kleene closure `*', and parentheses () are all supported. Currently '+' is not supported. Regular expressions are currently limited to approximately 30 characters (generally excluding meta - 3 - Formatted: August 24, 1994 AGREP(l) AGREP(l) June 11, 1991 characters). Some options (-d, -w, -x, -D, -I, -S) do not currently work with regular expressions. The maximal number of errors for regular expressions that use '*' or '|' is 4. EXAMPLES agrep -2 -c ABCDEFG foo gives the number of lines in file foo that contain ABCDEFG within two errors. agrep -1 -D2 -S2 'ABCD#YZ' foo outputs the lines containing ABCD followed, within arbitrary distance, by YZ, with up to one additional insertion (-D2 and -S2 make deletions and substitutions too "expensive"). agrep -5 -p abcdefghij /usr/dict/words outputs the list of all words containing at least 5 of the first 10 letters of the alphabet in order. (Try it: any list starting with academia and ending with sacrilegious must mean something!) agrep -1 'abc[0-9](de|fg)*[x-z]' foo outputs the lines containing, within up to one error, the string that starts with abc followed by one digit, followed by zero or more repetitions of either de or fg, followed by either x, y, or z. agrep -d '^From ' 'breakdown; (inter|arpa|bit)net' mbox outputs all mail messages (the pattern '^From ' separates mail messages in a mail file) that contain breakdown and one of either internet, arpanet, or bitnet. agrep -d '$$' -1 '<word1> <word2>' foo finds all paragraphs that contain word1 followed by word2 with one error in place of the blank. In particular, if word1 is the last word in a line and word2 is the first word in the next line, then the space will be substituted by a newline symbol and it will match. Thus, this is a way to overcome separation by a newline. Note that -d '$$' (or another delim which spans more than one line) is necessary, because otherwise agrep searches only one line at a time. agrep '^agrep' <this manual> outputs all the examples of the use of agrep in this man pages. SEE ALSO ed(1), ex(1), grep(1V), sh(1), csh(1). BUGS This is the first release of agrep. Expect some bugs, especially for more complicated patterns. Any bug reports or comments will be appreciated! Please mail them to sw@cs.arizona.edu or udi@cs.arizona.edu There may be problems when control characters - 4 - Formatted: August 24, 1994 AGREP(l) AGREP(l) June 11, 1991 (e.g., <ctrl>A ) are used as part of a string or delimiter. Regular expressions do not support the '+' operator (match 1 or more instances of the preceding token). These can be searched for by using this syntax in the pattern: 'pattern(pattern)*' (search for strings containing one instance of the pattern, followed by 0 or more instances of the pattern). sometimes adds an empty line to the output. The following can cause an infinite loop: agrep pattern * > output_file. If the number of matches is high, they may be deposited in output_file before it is completely read leading to more matches of the pattern within output_file (the matches are against the whole directory). It's not clear whether this is a "bug" (grep will do the same), but be warned. patterns are currently limited to approximately 30 characters. Lines are limited to 1024 characters. Records are limited to 8K, and may be truncated if they are larger than that. DIAGNOSTICS Exit status is 0 if any matches are found, 1 if none, 2 for syntax errors or inaccessible files. - 5 - Formatted: August 24, 1994